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Abstract 

Background: To infer a species phylogeny from unlinked genes, phylogenetic inference methods must confront 
the biological processes that create incongruence between gene trees and the species phylogeny. Intra-specific 
gene variation in ancestral species can result in deep coalescence, also known as incomplete lineage sorting, 
which creates incongruence between gene trees and the species tree. One approach to account for deep 
coalescence in phylogenetic analyses is the deep coalescence problem, which takes a collection of gene trees and 
seeks the species tree that implies the fewest deep coalescence events. Although this approach is promising for 
phylogenetics, the consensus properties of this problem are mostly unknown and analyses of large data sets may 
be computationally prohibitive. 

Results: We prove that the deep coalescence consensus tree problem satisfies the highly desirable Pareto property 
for clusters (clades). That is, in all instances, each cluster that is present in all of the input gene trees, called a 
consensus cluster, will also be found in every optimal solution. Moreover, we introduce a new divide and conquer 
method for the deep coalescence problem based on the Pareto property. This method refines the strict consensus 
of the input gene trees, thereby, in practice, often greatly reducing the complexity of the tree search and 
guaranteeing that the estimated species tree will satisfy the Pareto property. 

Conclusions: Analyses of both simulated and empirical data sets demonstrate that the divide and conquer 
method can greatly improve upon the speed of heuristics that do not consider the Pareto consensus property, 
while also guaranteeing that the proposed solution fulfills the Pareto property. The divide and conquer method 
extends the utility of the deep coalescence problem to data sets with enormous numbers of taxa. 



Introduction 

The rapidly growing abundance of genomic sequence data 
has revealed extensive incongruence among gene trees 
(e.g., [1,2]) that may be caused by processes such as deep 
coalescence (incomplete lineage sorting), gene duplication 
and loss, or lateral gene transfer (see [3-5]). In these cases, 
phylogenetic methods must account for and explain the 
patterns of variation among gene tree topologies, rather 
than simply assuming the gene tree topology reflects the 
relationships among species. In particular, there has been 
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much recent interest in phylogenetic methods that 
account for deep coalescence, which may occur in any 
sexually reproducing organisms (e.g., [6-8]). One such 
approach is the deep coalescence problem, which, given a 
collection of gene trees, seeks a species tree that minimizes 
the number of deep coalescence events [4,9] . Although the 
deep coalescence problem is NP-hard [10], recent algorith- 
mic advances enable scientists to solve instances with a 
small number of taxa [11,12] and efficiently compute 
heuristic solutions for data sets with slightly more species 
[13]. Still, the heuristics are based on generic local tree 
search strategies with no performance guarantees, and 
they cannot handle enormous data sets. In this study, we 
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prove that the deep coalescence problem satisfies the Par- 
eto consensus property. We then describe a new divide 
and conquer approach, based on the Pareto property, that, 
in practice, can greatly extend the utility of existing heuris- 
tics while guaranteeing that the inferred species tree also 
has the Pareto property with respect to the input gene 
trees. 

Related work 

The deep coalescence problem is an example of a super- 
tree problems, in which input trees with taxonomic over- 
lap are combined to build a species tree that includes all 
of the taxa found in the input trees (see [14]). In fact, it is 
among the few supertree methods that use a biologically 
based optimality criterion. One way of evaluating super- 
tree methods is by characterizing their consensus proper- 
ties (e.g., [15,16]). The consensus tree problem is the 
special case of the supertree problem in which all the 
input trees contain the same taxa. Since all supertree pro- 
blems generally seek to retain phylogenetic information 
from the input trees, one of the most desirable consensus 
properties is the Pareto property. A consensus tree pro- 
blem satisfies the Pareto property on clusters (or triplets, 
quartets, etc.) if every cluster (or triplet, quartet, etc.) that 
is present in every input tree appears in the consensus tree 
[15-17]. Many supertree problems satisfy the Pareto prop- 
erty for clusters in the consensus setting [15,16]. However, 
this has not been shovm for the deep coalescence problem. 

Our contributions 

We prove that the deep coalescence consensus tree pro- 
blem satisfies the Pareto property for clusters. This result 
provides useful guidance for the species tree search. 
Instead of evaluating all possible species trees, to find the 
optimal solution we need only to examine trees that 
satisfy the Pareto property on clusters. These trees will 
all be refinements of the strict consensus of the gene 
trees. Furthermore, the Pareto property allow us to show 
that the problem can be divided into smaller independent 
subproblems based on the strict consensus tree. We 
apply this property and describe a new divide and con- 
quer method, and our experiments demonstrate that this 
method can greatly improve the speed of deep coales- 
cence tree heuristics, potentially enabling efficient and 
effective estimates from inputs with several thousands of 
taxa. Future work will exploit the independence of the 
subproblems and solve these on parallel machines, which 
should result in even larger and more accurate solutions. 

Methods 

Basic definitions, notations, and preliminaries 

In this section we introduce basic definitions and nota- 
tions and then define preliminaries required for this 



work. For brevity some proofs are omitted in the text 
but available in Additional file 1. 

A graph G is an ordered pair {V, E) consisting of a 
non-empty set V of nodes and a set E of edges. We 
denote the set of nodes and edges of G by V{G) and E 
(G), respectively. If e = {«, v} is an edge of a graph G, 
then e is said to be incident with u and v. If v is a node 
of a graph G, then the degree of v in G is the number of 
edges in G that are incident with v. 

A tree T is a connected graph with no cycles. T is 
rooted if it has exactly one distinguished node of degree 
one, called the root, and we denote it by Ro(T). The 
unique edge incident with Ro(T) is called the root edge. 

Let r be a rooted tree. We define <7- to be the partial 
order on V {T) where x <t y y is a node on the path 
between Ro(T) and x. li x <Ty we call x a descendant 
of y, and y an ancestor of x. We also define x <t y ii x 
<T y and X * y, in this case we call x a proper descen- 
dant of y, and y a proper ancestor of x. The set of 
minima under <t is denoted by Le(r) and its elements 
are called leaves. A node is internal if it is not a leaf. 
The set of all internal nodes of T is denoted by /(T). 
Further, we will frequently refer to the subset of I{T) 
whose degree is two, and we denote this by hiT). 

Let X c Le (T) , we write x to denote the leaf com- 
plement of X when the tree T is clear from the context, 
where X= Le(T)\X. 

If {x, e E{T) and x <Ty then we call y the parent of 
X denoted by Par {x) and we call x a child of y. The set 
of all children of y is denoted by Chj (y) . If two nodes 
in T have the same parent, they are called siblings. The 
least common ancestor (LCA) of a non-empty subset X 
Q V(T), denoted as lca-j{X), is the unique smallest upper 
bound of X under <t. 

If e G E{T), we define Tie to be the tree obtained from 
T by identifying the ends of e and then deleting e. Tie is 
said to be obtained from T by contracting e. If v is a ver- 
tex of T with degree one or two, and e is an edge inci- 
dent with V, the tree Tie is said to be obtained from T 
by suppressing v. 

Examples of the following definitions are shown in 
Figure 1. Let X £ V{T), the subtree of T induced by X, 
denoted T{X), is the minimal connected subtree of T 
that contains Ro(T) and X. The restricted subtree of T 
induced by X, denoted as T\X, is the tree obtained from 
T{X) by suppressing all nodes with degree two. The sub- 
tree of T rooted above node v g V {T), denoted as T„ is 
the restricted subtree induced by {m e V (T): u <t v}. 

T is binary if every node has degree one or three. 
Throughout this paper, the term tree refers to a rooted 
binary tree unless otherwise stated. Also, the subscript 
of a notation may be omitted when it is clear from the 
context. 
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(a) 



T(X) 




(b) 



T|X 




(c) 



Figure 1 Examples of tree definitions, (a) A rooted tree T with four leaves la, b, c, d\. (b) The subtree of T induced by X where X = [a, b}. (c) 

The restricted subtree of J induced by X. 



Deep coalescence 

We define the deep coalescence cost function as demon- 
strated in Figure 2. Note that our definition of the deep 
coalescence cost given in Def 3, is somewhat different, but 
for our purposes equivalent, to its original definition also 
termed extra lineage given in [4]. The relationship 
between both definitions is shown in Additional file 1. 

Throughout this section we assume T and S are trees 
over the same leaf set. 

Definition 1 (Path length). Suppose x <t y, the path 
length from x to y, denoted pl-Ax, y), is the number of edges 
in the path from x to y. Further, /e^ X c y c Le (T) , we 
extend the path length function by plj{X,Y) = plj{lcaj{X), 
Ica-AY)). 



Definition 2 (LCA mapping). Let v e V{T}, the LCA 
mapping of v in S, denoted Mxsis{v)i is defined by 
MtwS (1^) = leas (Le (T„)) . 

Definition 3 (Deep coalescence). The deep coales- 
cence cost from T to S, denoted DC{T, S), is 

DC(r, S) ^ J2 MMt.s[u), MT^siv)) 

Using the extended path lengths, the deep coalescence 
cost can be equivalently expressed as 

DC(r, S) = J2 PkiMTu), Le(r„)) 

{u,v)eE{T) 
u<v 




s 




Figure 2 Example of deep coalescence cost definition. Example showing the deep coalescence cost from 7 to 5. Each edge of T is 
accompanied by its cost, and its corresponding path is shown on S. 



Lin ef al. BMC Bioinformatics 2012, 13(Suppl 10):S12 
httpy/www.biomedcentral.coni/1 471-21 05/1 3/S1 0/S1 2 



Page 4 of 13 



Consensus tree 

Definition 4 (Consensus tree problem). Let 
f : Tx X Tx TZ be a cost function where X is a leaf 
set and T x is the set of all trees over X. A consensus 
tree problem based onfis defined as follows. 
Instance: A tuple of n trees {Ti,...,T„) over X 
Find: The set of all trees that have the minimum 
aggregated cost with respect to f Formally, 

argmin I y^/(Tj, S) I 

This set is also called the solutions for the consensus 
tree instance. 

Definition 5 (Deep coalescence consensus tree pro- 
blem). We define the deep coalescence consensus tree 
problem to be the consensus tree problem based on the 
deep coalescence cost function. 
Cluster and Pareto 

Definition 6 (Quster). Let Tbea tree, the clusters induced 
by T, denoted, Cl(T) , is Cl(T) = {Le(T,) : v e V[T]}. 
Further, X e CI (T) is called a trivial cluster if 
X = Le (T) or |X| = 1, it is called non-trivial otherwise. 
Let Y c Le (T) , we say that T contains (cluster) Y if 
7 e CI (T) . 

Definition 7 (Pareto on clusters). Let P be a consensus 

tree problem based on some cost function. We say that P 
is Pareto on clusters if: for all instances I = (Ti,...,T„) of 
P,for all solutions S of I we have n",jCl(Tj) c C1{S). 

Theorem overview 

We wish to show that the deep coalescence consensus tree 
problem is Pareto on clusters. We describe a high level 
structure of the proof in this section and provide necessary 
supporting lemmata in the next section. The proof pro- 
ceeds by contradiction, assuming that the deep coales- 
cence consensus tree problem is not Pareto on clusters. By 
Def 7, the assumption implies that there exists an instance 
/ = {Ti,...,T„), a solution S for /, and a cluster X c Le (S) 
where X e nf^j Cl(Ti) but X ^ Cl(S). S being a solution 
for /, implies by Def. 4, that the aggregated deep coales- 

En 
^ DC{Ti, S) is minimized. Then, 

based on the existence of the cluster X, we edit S and form 
a new tree R using a tree edit operation which will be 
introduced in the next section. The properties of this new 
operation together with the properties of X (proved in the 
next section), provides the key ingredients to calculate 
the changes in deep coalescence costs. With some further 
arithmetics, this allows us to conclude that R in fact 
has a smaller aggregated deep coalescence cost, i.e. 

^ DC(T„ S) > 2_^^ ^ DC{Ti, R), hence contradicting 

the assumption that 5 is a solution for /. 



Supporting lemmata 
Shallowest regrouping operation 

In this section we formally define the new tree edit opera- 
tion that forms the key part of the theorem. We begin 
with some useful definitions related to the depth of nodes. 
An example of this operation is shown in Figure 3. 

Definition 8 (Node depth). The depth of a node ve V 
(T), denoted depr (v), is pi (v, Ro (T)) . 

Definition 9 (Shallowest nodes). Let T be a tree and 
X £ V{T), the shallowest function, denoted shallowestr 
(X), is the set of nodes in X which have the minimum 
depth among all nodes in X. Formally, we define shallo- 
west-x (X) = argmin^x {depj(v)). 

Now we have the necessary mechanics to define the 
new tree edit operation. In what follows, we assume S 
to be a tree, 0 c X c Le(S), and Si = S(X) . 

Definition 10 (Regroup). Let v g 1%{S). The regroup- 
ing operation of S by X on v, denoted T{S, X, v), is the 
tree obtained from S' by 

1. (Rl) Identify Ro(5'|X) and v. In other words we 
adjoin the root of tree S\X onto the node v. 

2. (R2) Suppress all nodes with degree two. 

Definition 11 (Shallowest regroup). The shallowest 
regrouping operation of S by X, denoted r(S, X), 
defines a set of trees by r(s, x) = {r(s, x, «) : k e sMtowests, {ii[s/))). 

As Figure 3 shows, the shallowest regrouping opera- 
tion pulls apart X from S and regroups X back onto 
each of the shallowest nodes in S. 
Counting the number of degree-two nodes 
The regrouping operation includes the step of suppres- 
sing nodes with degree two. Since this step affects path 
lengths and ultimately deep coalescence costs, we are 
required to count carefully the number of degree-two 
nodes under various conditions. Here we assume that T 
is a tree and {X, Y} is a bipartition of Le(T). We begin 
with two observations that assert existence of degree- 
two nodes, and assert existence of leaf sets given a 
degree-two node. 

Observation 1. hiUX) ^ 0 and. hiTiY) ^ 0. 

Observation l.Ifvs hiTiX)), then Le (T„) nX ^ 0 
and. Le(T„)n7 ^0. 

The next Lemma says that if the root of T is the par- 
ent of lca{X), then the number of degree-two nodes in 
r(X) is at least the depth of v, where v is a shallowest 
degree-two node of T{Y). 

Lemma 1. If Pa {lea (X)) = Ro (T) and v g shallowest 
ihinm, then depiv) < \I2{T{X))\. 

Proof. Assume the premise. Let n = dep{v), we observe 
that « > 1 because of the root edge. Figure 4 shows the 
setup and variable assignments for this proof Let v = Vi 
< ... <v„, and let B, A^, ... , A„ be the leaf sets of the indi- 
cated subtrees. We observe the following: 
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a b c d e 






b a c d e 



b a c d 0 



Figure 3 Example of the shallowest regrouping operation. Example of the shallowest regrouping operation of S by X where X = {a, c, d}. 
The intermediate tree 5' = S{X) shows its two shallowest degree-two nodes i/, and V2. Ri and R2 are the resulting trees of this operation. 
That is, r(S,X) = {Rl, R2] where ft, = r(S, X, v,) and fij = r(5, X, V2). 



T 



lca(X) = Vn 



V2 m 



V = Vi 





Figure 4 Setup and variable assignments for the proof of Lemma 1. Tree showing the variable assignments in the proof of Lemma 
Dotted lines represent omitted parts of the tree, and triangles represent subtrees. 
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♦ v„ = lca(X) because Pa(fca(X)) = Ro(T). 

» B Q X and d Y * 0, because v is a degree-two 

node of r(y). 

» Al n Y ^ 0 implies that A2,..; A„ each contains at 
least an element of Y. For otherwise, each of V2, -.Vn 
becomes a degree-two node in T{Y), contradicting 
the assumption that v = Vi is the shallowest degree- 
two node in T{Y). 

In order to obtain T(X), we must prune subtrees in 
whose leaves are in Y (which could be the entire subtree 
Al). Thus there must be at least one degree-two node in 
Al (or Vi ifAi is pruned). Similarly, for 1 <i < n, either V; 
has degree two or there exists a degree-two node in A,-. 
Overall T{X) has at least n degree-two nodes, as required, n 
Properties of the regrouping operation 
We examine some properties of the regrouping opera- 
tion in this section. In general, these properties show 
that the path lengths defined by LCA's do not increase 
under several different assumptions. This preservation 
of path lengths would later assist in the calculation of 
deep coalescence costs. Throughout this section, we 
assume 5 to be a tree, 0 C X c Le(S), and S' = S{X). 
Further we let R = T(S, X, v) where v e /2(5")- 

Lemma 2. If A c B c. Le(S) and B c. X , then pls{A, 
B) = pis- (A, B). 

Lemma 3. If Ac.BC. Le(S) and B c X , then pls{A, 
B) > plM, B). 

Lemma ^.IfACBC Le(S) and Bc X , then pls(A, 
B) > plM, B). 

Lemma S. If Ac Be Le(S), ACX, and X Q B, 
then plsiA, B) > plg(A, B). 

Proof. Let S" be the tree obtained from 5' by identifying 
Ro(S|X) and v. In other words, S" is the tree after step 
(Rl) of the regroup operation T{S, X, v). We will show 
that pis > {A, B) > pis- (A, B) > pIr{A, B). We begin with 
the first inequality. First, since A C X we know that leas 
(A) = Icas'iA) = Icas' (A). Let x = IcasiX) and b = Ica^B), 
then the assumption of X £ 5 implies x <s b. Since v has 
degree two in S', we know that Le (S,,) n X ^ 0 (Obser- 
vation 2), and so v <5 x. Now let x" = leas- (X) and b" = 
lcas"(B). By (Rl) we have that x" < s" v, and so x" < s" x, 
which implies b" < s" b. Furthermore, lcasiA)= leas- {A) is 
a descendant of both b and b" because A Q B, and hence 
b" < s- b implies that pls{A, B) > pis- {A, B). 

Next, by (R2) R is obtained from S" by suppressing 
some nodes, therefore a path in S" can only be made 
shorter in 7?, hence we have pls'{A, B) > pIr{A, B). 

Finally, combining the above results we have plsiA, B) 
> plsiA, B). □ 

Main theorem 

Theorem 1. Deep coalescence consensus tree problem is 
Pareto on clusters. 



Proof. Assume not for a contradiction, then there exists 
an instance / = (Z'i,...,r„), a solution S for /, and a cluster 
X c Le(S) where X e n;LiCl(Ti) but X ^ Cl(S) . Since 
X ^ Cl(S) , X must be non-trivial, therefore f (S, X) 

does not contain S and is not empty. Let J? e r(S, X) . 
We will show that I < i < n) {DC{Ti, S) >DC{Ti, R)), 

which implies ^ DC{Ti, S) > J^" ^ DC(r,-, R), 

contradicting the assumption that 5 is a solution for /. 

Let T = Ti where 1 < z < h, we will show that DC{T, 
S) >DC{T, R). This requires that DC{T, S) - DC(T, R) > 
0, in other words 

J2 iplsmr,), Le(T„)) - ph (Le(T„), Le(T,))) > 0 

{u,l/)€E(T) ^ ' 

u<v 

Since (1) sums over all edges in T, for convenience we 
partition the edges of T and compute the differences in 
path lengths for each partition individually. Figure 5 
depicts a running example for T, S, and R where X = {a, 
b, c}. 

We identify some specific nodes in order to partition 
the edges of T. Let S' = S(X), w e hiS') where R = Y(S, 
X, w). Since X ^ Cl(S) , S' contains at least two nodes 
with degree two. Let w'e /2(S") such that w' ^ w, then 
Stv' contains some leaf y ^ X (Observation 2). 

Let X = IcttT (X) and z = Icar (X U {j}), we partition 
the edges of T into {£1, £2, £3, £4} as follows. 

1. El = {{u, v} G £(T) : M <v < x} = A 11 edges under 

X 

2. £2 = {{u, v] e E{T) : X < u <v} = Edges forming 
the path from x to Ro(T) 

3. £3 = {{u,v} e E{T) : y < u <v < z} = Edges forming 
the path from j to z 

4. £4 A £(7) \ (£1 U £2 U £3) 

We consider (1) for each of the partition separately. 
For clarity, we define the aggregated cost difference E,- 
for partition £, as follows. 

= (P's(Le(T„), Le(T„))-p;fi(Le(T„),Le(T,))) 

{«,v)eEi 
u<v 

Hence (1) becomes 
El + E2 + Ss + S4 > 0 (2) 

Let x' = lcas(X) and p = pis [w, x') + 1. For each i e 
{1, 2, 3, 4}, we claim and prove the bound of E, as 
follows. 

Claim 1.1.1 > p 

Proof. First we observe that the difference for each 
path length in this partition is > 0 (Lemma 4), so we 
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a d b c e 




Figure 5 Running example for tlie proof of Tlieorem 1. A running example for the proof of Theorem 1 where 7 is a tree in the instance 

tuple /, S is an assumed solution for I,X = [a, b, c}, S' = S{X), U = Sx' , and R = r(S, X, w). Highlighted regions in T are the edge 
partitions E^, E2, and E3. The rest of edges form the partition E4. By counting the costs for each partition we have =6 - 4 = 2, E2 =1 - 3 = -2, 
X3 = 2 - 1 = 1, and E4 =3 - 3 = 0. Overall we have DC{T, S) - DC(T, ft) = 1. 



have Si > 0. Since x = leas W> we only need to con- 
sider the subtree S^- in computing the path lengths in 
this partition. Define U = S^'- In particular, the number 
of degree two nodes in U{X) gives us a lower bound on 
the total decreases of path lengths, because these nodes 
are removed to obtain U\X which is a subtree of R. 
That is, Si > \I2{U{X))\. Lemma 1 applies to U with 
bipartition {X,'Le[U)\X} and the node w, so we have I/2 
{U(X))\ > depu {w). The depth depu [w] is with respect 
to U, and we relate it to a path length in S by taking 
away the root edge, that is depu{w) — 1 =pls{u>,xf). 
Finally, using the definition of p we obtain Si > \I2{LI 
iX))\ > depu {w)= pis {w, x) +1= p. 

Claim 2. S2 = -p 

Proof. 

S2 = ph[X, Le(T)) - ph[X, Le(T)) 

= [p/s(^/,Ro(T)) - 1] - |1 +plR[w,xr)^plR[xr, Ro{T)) - 1] 
= p/s(», Ro(T)) - [1 +p/r(u),x/) +p/r{x/, Ro(T))] 
= pk{xi, Ro(T)) - [1 + ph[w,xr) + pls[xr, Ro(T))| 
= - \ph[ii>,x>) + 1] 
= -P 



The fourth equality holds because w is the shallowest 
degree-two node in S', so that no edges along the path 
from w to x' are contracted in 7?, hence pIr{w, x') = pis 
(w, x). 

Claim 3. S3 > 1 

Proof. Let {a, b} e £3 where a<T b, A = Le(Tfl), and 
B = Le(Ti,) . We know that A c X because otherwise 
this edge should be in E■^ or £2- We consider two cases 
for B. 

1. If B c X, then Lemma 3 applies on S, R, A, B, so 
pls(A, B) - plg(A, B) > 0. 

2. If X £ B, then Lemma 5 applies on S, R, A, B, so 
pls{A, B) - plM, B) > 0. 

In any case, we have pls(A, B) - pIr{A, B) > Q for each 
edge {a, b} e £3. This implies that S3 > 0. Further, since 
w' e 72(5") and w' ^ w, w' does not exist in R. We also 
know that y<sWf<slcas{XU {y}) by the definitions of w' 
and y. Therefore there exists an edge {a, b] e £3 such 
that pls{A, B) - pIr{A, B) > I. Hence we have S3 > 1. 

Claim 4. S4 > 0 
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Proof. Let {a, b} e £4 where a <t b, A = Le{Ta), and 
B = Le(Ti,) . The proof follows from the same argument 
as in Claim 3 where we have pis {A, B) - pIr(A, B) > 0 
for each edge {a, b} e £4, hence S4 > 0. 

Finally, we have Ei +Z2 +2^3 +2^4 ^ + + 1+0=1 
> 0. Hence (2) is satisfied, and so is (1). In sum, we 
have constructed a tree 7? and showed that 

^ DC(r„ S) > ^ DC(Ti, R), which contradicts 

with the assumption that S is a solution for /, in other 
words the assumption that 5 has the minimum aggre- 
gated cost with respect to the deep coalescence cost 
function. □ 

Algorithm for improving a candidate solution 

Algorithm 1 takes a consensus tree problem instance 
and a candidate solution as inputs. If the candidate solu- 
tion does not display the consensus clusters, it is trans- 
formed into one that includes all of the consensus 
clusters and has a smaller (more optimal) deep coales- 
cence cost. 

Algorithm 1 Deep coalescence consensus clusters 
builder 

1: procedure DCConsensusCIustersBuilder (/, T) 

Input: A consensus tree problem instance / = {Ti,..., 
T„), a candidate solution T for / 

Output: T, or an improved solution 7? that contains all 
consensus clusters of / 

2: R<- T 

3: C <— Set of all consensus clusters of / 

4: for all cluster X e C do 

5: if i? does not contain X then 

6: V <— A node in shallowest {I2{R{X))) (shal- 

lowest degree-two node of R{X) ) 

7: R <r- r{R, X, v) (regrouping operation of R by 

X on v) 

8: end if 

9: end for 

10: return R 

11: end procedure 

The correctness of Algorithm 1 follows from the proof 
of Theorem 1. We now analyze its time complexity. Let 
m be the number of taxa present in the input trees. Line 
3 takes 0(nm) time. Line 5, 6, and 7 each takes 0{m) 
time, and there are 0{m) iterations. Overall Algorithm 1 
takes 0{nm + rrP') time. 

General method for improving a search algorithm 

In this section we extend the result of Theorem 1 and 
show that the deep coalescence consensus tree problem 
exhibits optimal substructures based on the strict consen- 
sus tree of the problem instance. This leads to another 
simple and general method that improves an existing 



search algorithm. Figure 6 depicts a running example for 
this section. We now begin with some useful definitions. 

Definition 12 (Strict consensus tree [18]). Given a 
tuple ofn trees I = iTi,...,T„), the strict consensus tree of 
/, denoted StrictCon{I), is the unique tree that contains 
those clusters common to all the input trees. Formally, 
StrictCon(I) is a {possibly non-binary) tree S such that 

Cl(S) = f|"^^Cl(T,). 

Definition 13 (Cut on trees). Let H and T be two 
trees over the same leaf set, such that H is a non-binary 

tree and T is a binary tree that refines H. Given an 
internal node h in H, a cut on T via H and h, denoted 
CutH,fi(T) , is the minimal connected subtree of T that 
contains {Mh>t{c) : c e CHh(/i)}, and we rename each 
leaf X by Le(Tj;). 

We further extend this to a tuple of trees I = {Ti,...,T„) 
by CutH,f,(/) = (Cut„,f,(Ti), . . . , CutH,f,(T„)). 

Theorem 2. Let I =(ri, ...,£„) be an instance of the 
deep coalescence consensus tree problem, and let S be a 
solution for I (having the optimal deep coalescence cost). 
Further suppose H is the strict consensus tree of L and h 
is an internal node in H. Then CutH,fi(S) is a solution 
for the instance CutH,h(/) of the deep coalescence con- 
sensus tree problem. 

Proof. Let CutH,h(S) = S' and 

CutH,),(/) = (Cut„,h(ri), . . . , cut„,h(r„)) = (T'l, . . . , t;). 

First we observe that S must be a refinement of H by 
Theorem 1, therefore 5' is defined. We continue to 
prove by contradiction, assuming the premise holds but 
S' is not a solution for the instance C^xtH,h{i)• So let R' 
be a solution for the instance Cut/f,fj(/), this implies 

that ^"^DC(T'„ S') > ^" ^DC(T'„ R!). We now 

modify S by replacing S' with R' as follows: 

1. Remove all edges of S*, and remove all nodes of 5' 
excepts the root and the leaves. 

2. Identify Ro (S') with. Ro (R') 

3. For each leaf v of S', identify v with a leaf x of R' 
where x = Le(Si,) . 

Let the resulting tree be R. We will show that R has a 
lower deep coalescence cost, contradicting the assump- 
tion that 5 is a solution for /. 

Let T = Ti where 1 < « < «, it suffices to show that DC 
{T,S) >DC{T, R), in other words 

X) (pis(MT>s(«), M'n.sM) -pfR(MT>R(u), Mii.rM)) > 0 
{i<,v)eE(T) ^ ' 

For convenience, let ChH(^) = {ci,...,Cml,^' =A^h>t(^), 
and dj = MH>T{cj) where \ <j<m. 
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Figure 6 Running example for the definitions and proof of Theorem 2. A running example for the definitions and proof of Theorem 2. 
Arrows are marked by numbers 1 to 6, demonstrating the steps of the proof. Each step is explained below: (1) Given an instance / = (7,,...,r„), let 
H be the strict consensus tree of /. An internal node of H and its four children are shown. (2) Let S be a solution for /, having the optimal deep 
coalescence cost. (3) Cut trees via H and h, obtaining CutH,/i(J) and CutH,h(S) = S' . Let A ft Q D be the leaf sets of each subtree. (4) We 
assume by contradiction that S' is not a solution for CutH,/i (J) . and so we let ft' be a solution for CutH.h (.f) ■ (5 and 5) Modify S to obtain 
ft, by replacing 5' with fi'. 



Similar to the proof of Theorem 1, we partition the 
edges of T into {E^nden Eoue Ein] as follows. 
1. Eunder = {{u, v] € E{T) : u < V and (3;)(i/ < c\)} 

2- Eout — {{u, v} e £(T) :u<v and v ^ h'} 

3. Ein — E{T)\[Eunder U Egut) 

Recall that the modification of S into R only involves 
the subtree S", therefore Mj^siv) is unchanged for 
every v occurs in Eunder and Eout- Hence it suffices to 
evaluate (3) on £,„ only. However we have already 

assumed that ^DC{T S') >Y^, ^DC{T[, B!), 

therefore (3) holds. Overall we have that R has a lower 
deep coalescence cost, contradicting the assumption 
that 5 is a solution for /. □ 

Theorem 2 implies that every internal node of the 
strict consensus tree defines an independent subpro- 
blem, and solutions of these subproblems can be com- 
bined to give a solution to the original deep coalescence 
consensus tree problem. This leads to the following 



general divide and conquer method that improves an 
existing search algorithm. 

Method 1 Deep coalescence consensus tree method 

1: procedure DCConsensusTreeMethod(/) 

Input: A DC consensus tree problem instance / = 
{Ti,...,T„), an external program DC-SOLVER. 

Output: A candidate solution T for / 

2: H<r- StrictConil) 

3: for all internal node h of H do 

4: 4^ Cut„,,,(/) 

5: Sh <r- DC-SOLVER(4) 

6: Refine the children of h on H by the tree Sf, 
7: end for 
8: return H 
9: end procedure 

Results 

We used simulation experiments to (i) test if the solu- 
tions obtained from efficient heuristics presented in [13] 
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display the Pareto property, and (ii) compare the perfor- 
mance of our new divide and conquer approach based 
on the Pareto property to the generic heuristic in [13]. 

Experiment results 1 

First to examine if subtree pruning and regrafting (SPR) 
heuristic solutions from [13] display the Pareto property, 
we generated a series of four 14-taxon trees that share few 
clusters. To do this, we first generated random ll-taxon 
trees. Next, we generated random 4-taxon trees containing 
the species 11-14. We then replaced the one of the leaves 
in the 11-taxon tree with the random 4-taxon tree. This 
procedure produces gene trees that share at least a single 
4-taxon cluster in common. Although this simulation does 
not reflect a biological process, it represents extreme cases 
of error or incongruence among gene trees. In three cases 
with the 14-taxon gene trees, we found that the SPR heur- 
istic did not return a result that contained the consensus 
cluster. In these cases, our proof demonstrates that there 
exists a better solution that also contained the consensus 
cluster. However, the failure of the SPR heuristic in these 
cases appears to depend on the starting tree; these data 
sets did not fail with all starting trees. Thus, the shortcom- 
ings of the SPR heuristic may be ameliorated by perform- 
ing multiple runs from different starting trees. 

Experiment results 2 

We next evaluated the efficacy and scalability of Method 1 
and compared it to the standalone SPR heuristic. We gen- 
erate sets of gene trees, each with different consensus tree 
structures (depths and branch factors) as follows. The 
depth of a tree is the maximum number of edges from the 
root to a leaf, and the branch factor of a tree is the maxi- 
mum degree of the nodes. For each depth d and a branch 
factor b, we first generate a complete b-zxy tree of depth 
d, denoted Q,^- This tree represents the consensus tree. 
We used depths of 2-5, and branch factors of 3-30. For 
each we then generated 10 sets of 20 random gene 
trees, such that each gene tree is a binary refinement of 
C^fe. Each set of input trees was given as input to Method 
1, using [13] as the external deep coalescence solver. For 
comparison, we ran the same data sets using [13] as the 
standalone deep coalescence solver. We calculated the 
deep coalescence score for each output species tree, and 
we report the average score of 10 profiles as the score for 
each Cd,b- We also measured and recorded the average 
runtime of each run. We terminate the execution of the 
standalone solver if the runtime exceeds two minutes, and 
in this case the results are not shown. In general. Figure 7 
shows that the scores of the trees were very similar from 
Method 1 and the standalone SPR heuristic. Thus, Method 
1 does not appear to improve the quality of the deep coa- 
lescence species trees. However, Method 1 shows extreme 



improvements in the runtime, especially as the branch fac- 
tors increase. 

Experiment results 3 

Finally, we examined the performance of Method 1 and 
compare it to the standalone SPR heuristic using more 
biologically plausible coalescence simulations. We fol- 
lowed the general structure the coalescence simulation 
protocol described by Maddison and Knowles [9]. First, 
we generated 40 256-taxon species trees based on a Yule 
pure birth process using the r8s software package [19]. To 
transform the branch lengths from the Yule simulation to 
represent generations, we multiplied them all by 
1,000,000. Next, we simulated coalescence within each 
species tree (assuming no migration or hybridization) 
using Mesquite [20]. All simulations produced a single 
gene copy from each species. For each species tree, we 
simulated 20 gene trees assuming a constant population 
size. The population size effects the number of deep coa- 
lescence events, with larger populations leading to more 
incomplete lineage sorting and consequently less agree- 
ment among the gene trees. Thus, to incorporate different 
levels of incomplete lineage sorting, for 20 of the species 
trees, we used a constant population size of 10,000, and 
for 20 we used a constant population size of 100,000. 
Thus, in total, we produced 40 sets of 20 gene trees, with 
each set simulated from a different 256-taxon species tree. 

For each data set, we performed a phylogenetic analyses 
using Method 1 and also using only the SPR heuristic 
from Bansal et al. [13]. In contrast to the simulations in 
Experiment 1, the standalone SPR heuristic of Bansal et al. 
[13] always returned species trees with all consensus clus- 
ters. Of course, all solutions from Method 1 must display 
the Pareto property. The deep coalescence reconciliation 
score for the best trees were similar with both algorithms. 
When the population size was 10,000, the average coales- 
cence cost was 279, and all the gene trees shared an aver- 
age of 29.4 clusters. In 19 out of the 20 of these 
simulations, both approaches produced the same results, 
while in one case, Method 1 found a species tree with a 
one fewer implied deep coalescence event. When the 
population size was 100,000, the average coalescence cost 
was 2142, and the all gene trees shared an average of 19.1 
clusters. Although the reconciliation cost never differed by 
more than 15, Method 1 had a better score in 6 replicates, 
and the standalone SPR had a better score in 11 replicates. 
All analyses finished within 30 seconds in a laptop PC, but 
Method 1 was always fester than SPR alone. 

Discussion 

In addition to offering a biologically informed optimality 
criterion to resolve incongruence among gene trees, we 
prove that the deep coalescence problem also is guaranteed 
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Figure 7 Deep coalescence score and runtime results for Experiment 2. Legend: blue represents Method 1 (divide and conquer) and 
orange represents standalone SPR heuristic. 



to retain the phylogenetic clusters for which all gene trees can be leveraged to vastly improve upon the running time 

agree. Since the deep coalescence problem is NP-hard [10], of heuristics. Method 1 represents a new general approach 

most meaningful instances will require heuristics to esti- to phylogenetic algorithms. In most cases, heuristics to 

mate a solution. We demonstrate that the Pareto property estimate solutions for phylogenetic inference problems are 
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based on a few generic search strategies such as the local 
search heuristics based on nearest neighbor interchange 
(NNI), SPR, or tree bisection and reconnection (TBR) 
branch swapping. Although these search strategies often 
appear to perform well, they are not connected to any spe- 
cific phylogenetic problems or optimality criteria. Ideally, 
however, efficient and effective heuristics should be tailored 
to the properties of the phylogenetic problem. In the case 
of the deep coalescence consensus tree problem, the Pareto 
property provides an informative guiding constraint for the 
tree search. Specifically, when considering possible solu- 
tions, we need only consider solutions that contain all clus- 
ters from the input gene trees, or, in other words, that 
refine the strict consensus of the input gene trees. 

Still, our simulation experiments suggest that, in many 
cases, the SPR local search heuristic described by Bansal 
et al. [13] performs well. While we identified cases in 
which the estimate from the SPR heuristic did not con- 
tain the Pareto clusters, in most cases SPR alone found 
trees as good, or even slightly better, than Method 1. We 
note that the size of the simulated coalescence data set, 
256 taxa, exceeds the size of the largest published analysis 
of the deep coalescence consensus tree problem and is 
far beyond the largest instances (8 taxa) from which 
exact solutions have been calculated [11], and the SPR 
found good solutions within 30 seconds. Still, running 
time for the SPR heuristic does not always scale well, and 
the results of Experiment 2 suggest that it might not be 
tractable for extremely large data sets. In these cases, in 
practice Method 1 may vastly improve upon the running 
time, while guaranteeing a solution with the Pareto 
property. 

Further, Theorem 2 shows that the deep coalescence 
consensus tree problem exhibits independent optimal 
substructures. This implies that, once we compute the 
strict consensus tree of the problem instance, the rest of 
Method 1 can be directly parallelized, regardless of which 
external deep coalescence solver is used. In the case 
where the external solver guarantees exact solutions, our 
method would also give exact solutions, but can poten- 
tially solve instances with a much larger taxa size com- 
pared to running the external solver alone. 

Although the Pareto property for the deep coalescence 
consensus tree problem is desirable, and the divide and 
conquer method is promising for large-scale analyses, 
there are limitations to their use. First, the Pareto property 
and Method 1 are limited to the consensus case, or, 
instances in which all of the input gene trees contain 
sequences from all of the species. Also, the Pareto prop- 
erty is only useful when all input trees share some clusters 
in common. If there are no consensus clusters among the 
input trees, then Method 1 conveys no run-time benefits. 
While this may seem like an extreme case, it is possible 
with high levels of incomplete lineage sorting, or, perhaps 



more likely, much error in the gene tree estimates. Also, 
as we add more and more gene trees, we would expect 
more instances of conflict among the gene trees, poten- 
tially converging towards the elimination of consensus 
clusters. Than and Rosenberg [21] recently proved the 
existence of cases in which the deep coalescence problem 
is inconsistent, or converges on the wrong species tree 
estimate with increasing gene tree data. Although incon- 
sistency is concerning, the Pareto property provides some 
reassurance. Even in a worse case scenario in which the 
deep coalescence problem is misled, the optimal solutions 
will still contain all of the agreed upon clades from the 
gene trees. Perhaps the greatest advantage of the deep coa- 
lescence problem, especially compared to likelihood and 
Bayesian approaches that infer species trees based on coa- 
lescence models (e.g., [22-24]), is its computational speed 
and the feasibility of estimating a species tree from large- 
scale genomic data sets representing hundreds or even 
thousands of taxa [13]. Not only can our method improve 
the performance of any existing heuristic, the Pareto prop- 
erty describes a limited subset of possible species trees 
that must contain the optimal solution. 

Conclusions 

We prove that the deep coalescence consensus tree pro- 
blem satisfies the Pareto property for clusters and 
describe an efficient algorithm that, given a candidate 
solution that does not display the consensus clusters, 
transforms the solution so that it includes all the con- 
sensus clusters and has a lower deep coalescence cost. 
We extend the result and prove that the problem exhi- 
bits optimal substructures based on the strict consensus 
tree of the input gene trees. Based on this property, we 
suggest a new, parallelizable tree search method, in 
which we refine the strict consensus of the input gene 
trees. In contrast to previously proposed heuristics, this 
method guarantees that the proposed solution will con- 
tain the Pareto clusters. Also, as our experiments 
demonstrate, this method can greatly improve the speed 
of deep coalescence tree heuristics, potentially enabling 
efficient and effective estimates from input with thou- 
sands of taxa. 
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